Text Mining - Tidy Text - Shakespeare and Marlowe

(adapted from J. Silge and D. Robinson's Text Mining with R: a Tidy Approach)

In this notebook, you will be introduced to the basics of tidy text mining using the tidytext R library, which shares syntax with H. Wickham's popular tidyverse suite of packages (including ggplot2, a powerful graphic library).

The main datasets that you will work with are

  • a selection of Shakespeare's plays,
  • a selection of Christopher Marlowe's play, and
  • the AP Recaps data set from the TMNLP 1 notebook.

SECTIONS

  1. Pipeline Operator %>%
  2. Tidy Text Structure
  3. Tidy Text Flow
  4. Tidy Text Analysis
  5. Comparison with Marlowe
  6. Comparison with a Modern Text
  7. $n$-grams
  8. Exercises

For this notebook, we do not aggregate all the library calls into a single call at the top. As an aside, the order of these calls can affect the analytical process as some libraries mask other libraries' methods.


Back to top

PIPELINE OPERATOR %>%

R is a functional language, which means nested parentheses, which make code hard difficult to read. The pipeline operator %>% and the package dplyr can be used to remedy the situation.

Hadley Wickham provided an example in 2014 to illustrate how it works:

hourly_delay <- filter( summarise( group_by( filter( flights, !is.na(dep_delay) ), date, hour ), delay = mean(dep_delay), n = n() ), n > 10 )

Take some time to figure out what is supposed to be happening here.

The pipeline operator eschews nesting function calls in favor of passing data from one function to the next:

hourly_delay <- flights %>% filter(!is.na(dep_delay)) %>% group_by(date, hour) %>% summarise( delay = mean(dep_delay), n = n() ) %>% filter(n > 10)

The beauty of this approach is that it can be 'read' aloud to discover what the block of code is meant to do.

the flights data frame is 1. filtered (to remove missing values of the dep_delay variable), 2. grouped by hours within days, 3. the mean delay is calculated within groups, and 4. the mean delay is returned for those hours with more than n > 10 flights.

The pipeline rules are simple: the object on the left hand side is passed as the first argument to the function on the right hand side.

  • data %>% function is the same as function(data)
  • data %>% function(arg=value) is the same as function(data, arg=value)

References: https://cran.r-project.org/web/packages/magrittr/vignettes/magrittr.html


Back to top

TIDY TEXT STRUCTURE

In a Text Mining context, text typically comes in one of 4 formats:

  1. string (character vectors)
  2. corpus (raw strings - content, meta)
  3. document-term matrix (row: documents, column: term, entries: text weights/statistics)
  4. tidy text

Tidy data has specific structure:

  • Each variable is a column
  • Each observation is a row
  • Each type of observational unit is a table

Tidy text is a table with one token (single word, $n$-gram, sentence, paragraph) per row. As we've seen before, words have to be tokenized to commonly-used units of text.

Consider the following haiku by master Matsuo Basho:

In [1]:
haiku <- c('In the twilight rain',
            'these brilliant-hued hibiscus -',
            'A lovely sunset')
haiku
Out[1]:
  1. 'In the twilight rain'
  2. 'these brilliant-hued hibiscus -'
  3. 'A lovely sunset'

Let's turn it into a data frame:

In [2]:
library(dplyr)
haiku.df <- data.frame(text=haiku, stringsAsFactors = FALSE) # last parameter is important, we want to be able to separate the words
haiku.df
Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

Out[2]:
text
In the twilight rain
these brilliant-hued hibiscus -
A lovely sunset
In [3]:
library(tidytext)
haiku.df %>% unnest_tokens(word,text) # which should unnest the tokens in text.df with parameters word and text
Out[3]:
word
1in
1.1the
1.2twilight
1.3rain
2these
2.1brilliant
2.2hued
2.3hibiscus
3a
3.1lovely
3.2sunset

unnest_token() separates the tokens (words, in this example), strips away the punctuation, converts to lowercase, records the token line and the order.


Back to top

TIDY TEXT FLOW

In general, we

  1. start with Text Data
  2. un-nest the tokens to produce the first iteration of Tidy Text
  3. clean the Tidy Text as required
  4. summarize the Tidy Text into a first iteration of Summarized Text
  5. clean and analyze the Summarized Text
  6. visualize and present the Text Mining results

Back to top

TIDY TEXT ANALYSIS - SHAKESPEARE

Let's illustrate the flow with some of Shakespeare's plays.

(Gutenberg Project ID - Romeo and Juliet: 1112, Hamlet: 1524, Macbeth: 2264, A Midsummer Night's Dream: 2242, etc.)

References: http://www.gutenberg.org/ebooks/search/?query=Shakespeare

In [4]:
library(gutenbergr)
will_shakespeare <-gutenberg_download(c(1112,1524,2264,2242,2267,1120,1128,2243,23042,1526,1107,2253,1121,1103,2240,2268,1535,1126,1539,23046,1106,2251,2250,1790,2246,1114,1108,2262,1109,1537))
Determining mirror for Project Gutenberg from http://www.gutenberg.org/robot/harvest
Using mirror http://aleph.gutenberg.org

Now we produce a Tidy Text dataset

In [5]:
library(stringr) # necessary to use str_extract

tidy_ws <- will_shakespeare %>% 
  unnest_tokens(word,text) %>%
  mutate(word = str_extract(word,"[a-z']+")) %>% # to make sure we're not picking up stray punctuation and odd encodings
  anti_join(stop_words) %>%  # removing the heading business 
  na.omit() # remove NAs

head(tidy_ws)
Joining, by = "word"
Out[5]:
gutenberg_idword
1103 etext
1103 copyright
1103 implications
1103 read
1103 electronic
1103 version

For which we can easily produce a word count:

In [6]:
tidy_ws %>% 
  count(word, sort=TRUE)

library(ggplot2)

tidy_ws %>% 
  count(word, sort=TRUE) %>%
  filter(n > 500) %>%
  mutate(word=reorder(word,n)) %>%
  ggplot(aes(word,n)) + 
    geom_col() +
    xlab("Frequent words in selected Shakespeare plays") +
    ylab("Word count") +
    coord_flip()
Out[6]:
wordn
thou 4268
thy 2978
thee 2511
sir 2332
lord 2211
enter 2050
haue 1718
hath 1519
king 1519
love 1179
tis 1142
time 1022
mine 880
duke 789
heart 758
day 738
exeunt 732
night 726
art 704
doth 694
life 674
father 658
death 650
true 648
exit 645
hand 645
world 620
sweet 607
speak 604
pray 601
yest 1
yesterdayes 1
ymantled 1
ynch 1
yoak'd 1
yokefellow 1
yokefellowes1
yonds 1
yongrey 1
yorick 1
yorick's 1
yorkes 1
yorkshire 1
youl'd 1
young'st 1
younglings 1
youngly 1
ysicles 1
yslaked 1
zanies 1
zany 1
zeales 1
zed 1
zenelophon 1
zenith 1
zo 1
zodiac 1
zodiacs 1
zone 1
zwagger'd 1
Out[6]:

Back to top

COMPARISON WITH MARLOWE

Let's do the same for Christopher Marlowe, a contemporary of William Shakespeare

In [7]:
kit_marlowe <-gutenberg_download(c(779,1094,901,20288,16169,1589,1496,18781))
In [8]:
tidy_km <- kit_marlowe %>% 
  unnest_tokens(word,text) %>%
  mutate(word = str_extract(word,"[a-z']+")) %>% # to make sure we're not picking up stray punctuation and odd encodings
  anti_join(stop_words) %>%  # removing the stop words in the tidytext dataset stop_words
  na.omit() # remove NAs

head(tidy_km)
Joining, by = "word"
Out[8]:
gutenberg_idword
779 tragical
779 history
779 doctor
779 faustus
779 christopher
779 marlowe

We'll look at both of these datasets simultaneously. In order to do so, we'll build a word_count data set with the help of the pipeline operator. One of its advantages is that we can build the query sequentially and easily see the output at various stages.

We'll start by binding tidy_ws and tidy_km into a single dataset.

In [9]:
library(tidyr)
library(stringr) # for str_extract

word_count <- bind_rows(mutate(tidy_ws,author="WillShakespeare"),mutate(tidy_km,author="KitMarlowe")) 

head(word_count)
tail(word_count)
Out[9]:
gutenberg_idwordauthor
1103 etext WillShakespeare
1103 copyright WillShakespeare
1103 implications WillShakespeare
1103 read WillShakespeare
1103 electronic WillShakespeare
1103 version WillShakespeare
Out[9]:
gutenberg_idwordauthor
20288 mine KitMarlowe
20288 eyes KitMarlowe
20288 witness KitMarlowe
20288 grief KitMarlowe
20288 innocency KitMarlowe
20288 exeunt KitMarlowe

Then we'll execute a word count for each of the authors (note the sorting of the outputs, and the new field $n$).

In [10]:
word_count <- bind_rows(mutate(tidy_ws,author="WillShakespeare"),mutate(tidy_km,author="KitMarlowe")) %>% # create a new variables which will identify the author
   count(author,word)

head(word_count)
tail(word_count)
Out[10]:
authorwordn
KitMarloweab 1
KitMarloweabandon 3
KitMarloweabandon'd 1
KitMarloweabandons 1
KitMarloweabate 2
KitMarloweabated 1
Out[10]:
authorwordn
WillShakespearezo 1
WillShakespearezodiac 1
WillShakespearezodiacs 1
WillShakespearezone 1
WillShakespearezounds 7
WillShakespearezwagger'd 1

To follow the tidy approach, we need word_count to have a unique value for each word for each author. In this case, grouping by author won't have an effect (why?), but let's add the line anyway for completeness' sake.

In [11]:
word_count <- bind_rows(mutate(tidy_ws,author="WillShakespeare"),mutate(tidy_km,author="KitMarlowe")) %>% # create a new variables which will identify the author
  count(author,word) %>% # count the word for each author's work
  group_by(author)

head(word_count)
tail(word_count)
Out[11]:
authorwordn
KitMarloweab 1
KitMarloweabandon 3
KitMarloweabandon'd 1
KitMarloweabandons 1
KitMarloweabate 2
KitMarloweabated 1
Out[11]:
authorwordn
WillShakespearezo 1
WillShakespearezodiac 1
WillShakespearezodiacs 1
WillShakespearezone 1
WillShakespearezounds 7
WillShakespearezwagger'd 1

The size of the datasets was different, as we're using a higher number of Shakespeare plays. Rather than look at raw counts (which would naturally favour the Bard's output), we'll look at proportions: $$\frac{\mbox{number of occurrences of a specific term in the dataset}}{\mbox{total number of terms in the dataset}} =\frac{n}{\sum n}$$

In [12]:
word_count <- bind_rows(mutate(tidy_ws,author="WillShakespeare"),mutate(tidy_km,author="KitMarlowe")) %>% # create a new variables which will identify the author
  count(author,word) %>% # count the word for each author's work
  group_by(author) %>% # provide the output for each author
  mutate(proportion = n / sum(n))

head(word_count)
tail(word_count)
Out[12]:
authorwordnproportion
KitMarlowe ab 1 1.491291e-05
KitMarlowe abandon 3 4.473873e-05
KitMarlowe abandon'd 1 1.491291e-05
KitMarlowe abandons 1 1.491291e-05
KitMarlowe abate 2 2.982582e-05
KitMarlowe abated 1 1.491291e-05
Out[12]:
authorwordnproportion
WillShakespearezo 1 3.206022e-06
WillShakespearezodiac 1 3.206022e-06
WillShakespearezodiacs 1 3.206022e-06
WillShakespearezone 1 3.206022e-06
WillShakespearezounds 7 2.244216e-05
WillShakespearezwagger'd 1 3.206022e-06

We can now remove the raw counts to focus on the proportions.

In [13]:
word_count <- bind_rows(mutate(tidy_ws,author="WillShakespeare"),mutate(tidy_km,author="KitMarlowe")) %>% # create a new variables which will identify the author
  count(author,word) %>% # count the word for each author's work
  group_by(author) %>% # provide the output for each author
  mutate(proportion = n / sum(n)) %>% # rather than raw counts, look at proportion of usage
  select(-c(n))

head(word_count)
tail(word_count)
Out[13]:
authorwordproportion
KitMarlowe ab 1.491291e-05
KitMarlowe abandon 4.473873e-05
KitMarlowe abandon'd 1.491291e-05
KitMarlowe abandons 1.491291e-05
KitMarlowe abate 2.982582e-05
KitMarlowe abated 1.491291e-05
Out[13]:
authorwordproportion
WillShakespearezo 3.206022e-06
WillShakespearezodiac 3.206022e-06
WillShakespearezodiacs 3.206022e-06
WillShakespearezone 3.206022e-06
WillShakespearezounds 2.244216e-05
WillShakespearezwagger'd 3.206022e-06

We reshape the word_count dataset to faciliate the analysis: each word is now represented by a row, and the proportion of the time it appears in each author's writings is shown in the columns.

In [14]:
word_count <- bind_rows(mutate(tidy_ws,author="WillShakespeare"),mutate(tidy_km,author="KitMarlowe")) %>% # create a new variables which will identify the author
  count(author,word) %>% # count the word for each author's work
  group_by(author) %>% # provide the output for each author
  mutate(proportion = n / sum(n)) %>% # rather than raw counts, look at proportion of usage
  select(-c(n)) %>% # removes the count (keeps the proportion)
  spread(author,proportion)

head(word_count,20)
tail(word_count)
Out[14]:
wordKitMarloweWillShakespeare
a'kin NA3.206022e-06
a'leven NA3.206022e-06
a'th NA3.847227e-05
a'that NA3.206022e-06
aand NA3.206022e-06
aaron NA3.077781e-04
aaron's NA3.206022e-06
ab 1.491291e-05 NA
abandon 4.473873e-051.282409e-05
abandon'd 1.491291e-051.603011e-05
abandoned NA6.412044e-06
abandons 1.491291e-05 NA
abase NA3.206022e-06
abash'd NA3.206022e-06
abate 2.982582e-054.488431e-05
abated 1.491291e-056.412044e-06
abatement NA6.412044e-06
abatements NA3.206022e-06
abates NA3.206022e-06
abb 1.491291e-055.129636e-05
Out[14]:
wordKitMarloweWillShakespeare
zoons 1.491291e-05 NA
zor 1.491291e-05 NA
zounds NA2.244216e-05
zula 1.491291e-05 NA
zur 1.491291e-05 NA
zwagger'd NA3.206022e-06

Let's see what proportion of each author's output is not found in the other's:

In [15]:
(WS_nKM <- sum(word_count$WillShakespeare[is.na(word_count$KitMarlowe)])) # % of Shakespeare's output not in Marlowe
(KM_nWS <- sum(word_count$KitMarlowe[is.na(word_count$WillShakespeare)])) # % of Marlowe's output not in Shakespeare
Out[15]:
0.266401208029162
Out[15]:
0.18846933905989

Finally, we re-organize the table for use with ggplot() (strictly-speaking, this step is not mandatory).

In [16]:
word_count <- bind_rows(mutate(tidy_ws,author="WillShakespeare"),mutate(tidy_km,author="KitMarlowe")) %>% # create a new variables which will identify the author
  count(author,word) %>% # count the word for each author's work
  group_by(author) %>% # provide the output for each author
  mutate(proportion = n / sum(n)) %>% # rather than raw counts, look at proportion of usage
  select(-c(n)) %>% # removes the count (keeps the proportion)
  spread(author,proportion) %>% # reshapes the table in a tidy format 
  gather(author, proportion, `WillShakespeare`)

head(word_count)
tail(word_count)
Out[16]:
wordKitMarloweauthorproportion
a'kin NA WillShakespeare3.206022e-06
a'leven NA WillShakespeare3.206022e-06
a'th NA WillShakespeare3.847227e-05
a'that NA WillShakespeare3.206022e-06
aand NA WillShakespeare3.206022e-06
aaron NA WillShakespeare3.077781e-04
Out[16]:
wordKitMarloweauthorproportion
zoons 1.491291e-05 WillShakespeare NA
zor 1.491291e-05 WillShakespeare NA
zounds NA WillShakespeare2.244216e-05
zula 1.491291e-05 WillShakespeare NA
zur 1.491291e-05 WillShakespeare NA
zwagger'd NA WillShakespeare3.206022e-06
In [17]:
library(scales)

ggplot(word_count, aes(x = proportion, y = `KitMarlowe`, color = abs(`KitMarlowe` - proportion))) +
  geom_abline(color = "gray40", lty = 2) +
  geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
  scale_x_log10(labels = percent_format(),limits=c(0.0001,0.02)) +
  scale_y_log10(labels = percent_format(),limits=c(0.0001,0.02)) +
  scale_color_gradient(limits = c(0, 0.001), low = "red", high = "gray75") +
  theme(legend.position="none") +
  labs(y = "Kit Marlowe", x = "Will Shakespeare")
Warning message:
“Removed 31721 rows containing missing values (geom_point).”Warning message:
“Removed 31554 rows containing missing values (geom_text).”
Out[17]:

Words near the straight line are used with roughly the same frequency by both authors. For instance: lord, king, thou in the high-frequency spectrum and beasts, estate and glory in the low-frequency spectrum.

Words away from the straight line are used more frequently by one of the authors: lady and caesar seem to be used relatively more often by Shakespeare than by Marlowe, and aeneas and doctor are in the opposite situation (these terms are specific to plays).

The colour is related to the distance between the relative frequencies of a term for each author (red is close, gray is far). What could explain the shape of the red cloud (large at the bottom, thin at the top)?

Note the presence of both beasts and beast -- what does that tell you about the texts? And are you surprised about the prevalence of terms like `enter, exit and exeunt?

Finally, let's see if we can quantify the similarity in word usage.

In [18]:
cor.test(data = word_count, ~ proportion + `KitMarlowe`)
Out[18]:
	Pearson's product-moment correlation

data:  proportion and KitMarlowe
t = 146.36, df = 8431, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.8409551 0.8530152
sample estimates:
      cor 
0.8470942 

There's a fairly strong correlation (0.847077) between the relative term frequencies for the two wordsmiths (among those terms which are found in both text outputs -- see KM_nWS and WS_nKM). That's not entirely unexpected, since they were contemporaries: one would naively predict that the depth of their vocabulary and the way they deployed it would be linked, to some extent.

But without comparisons to other texts, it's hard to really put this value in perspective.


Back to top

COMPARISON WITH A MODERN TEXT

Let's see how Shakespeare and Marlow compare to a modern body of work: the AP recaps dataset.

In [19]:
# Import text data
recaps <- read.csv(file="Data/Recap_data.csv", header=TRUE, sep=",", stringsAsFactors=FALSE)

# Isolate text from recaps: AP.recaps
AP.recaps <- recaps$AP_Recap
In [20]:
# Cast the data in a data frame
recaps.df <- data.frame(text=AP.recaps, stringsAsFactors = FALSE)
In [21]:
# Create a tidytext dataset
tidy_AP <- recaps.df %>% 
  unnest_tokens(word,text) %>%
  mutate(word = str_extract(word,"[a-z']+")) %>% # to make sure we're not picking up stray punctuation and odd encodings
  anti_join(stop_words) %>%  # removing the stop words in the tidytext dataset stop_words
  na.omit() # remove NAs

head(tidy_AP) # inspect
Joining, by = "word"
Out[21]:
word
1auston
2matthews
4minutes
5nhl
6record
7book

As demonstrated above, the tidy structure can easily be meshed with ggplot2 to produce insightful graphics.

In [22]:
word_count_2 <- bind_rows(mutate(tidy_ws,author="WillShakespeare"),mutate(tidy_AP,author="AP_recaps")) %>% # create a new variables which will identify the author
  count(author,word) %>% # count the word for each author's work
  group_by(author) %>% # provide the output for each author
  mutate(proportion = n / sum(n)) %>% # rather than raw counts, look at proportion of usage
  select(-c(n)) %>% # removes the count (keeps the proportion)
  spread(author,proportion) %>% # reshapes the table in a tidy format 
  gather(author, proportion, `WillShakespeare`)

word_count_3 <- bind_rows(mutate(tidy_km,author="KitMarlowe"),mutate(tidy_AP,author="AP_recaps")) %>% # create a new variables which will identify the author
  count(author,word) %>% # count the word for each author's work
  group_by(author) %>% # provide the output for each author
  mutate(proportion = n / sum(n)) %>% # rather than raw counts, look at proportion of usage
  select(-c(n)) %>% # removes the count (keeps the proportion)
  spread(author,proportion) %>% # reshapes the table in a tidy format 
  gather(author, proportion, `KitMarlowe`)

ggplot(word_count_2, aes(x = proportion, y = `AP_recaps`, color = abs(`AP_recaps` - proportion))) +
  geom_abline(color = "gray40", lty = 2) +
  geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
  scale_x_log10(labels = percent_format(),limits=c(0.0001,0.02)) +
  scale_y_log10(labels = percent_format(),limits=c(0.0001,0.02)) +
  scale_color_gradient(limits = c(0, 0.001), low = "blue", high = "gray75") +
  theme(legend.position="none") +
  labs(y = "AP Recaps", x = "Will Shakespeare")

ggplot(word_count_3, aes(x = proportion, y = `AP_recaps`, color = abs(`AP_recaps` - proportion))) +
  geom_abline(color = "gray40", lty = 2) +
  geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
  scale_x_log10(labels = percent_format(),limits=c(0.0001,0.02)) +
  scale_y_log10(labels = percent_format(),limits=c(0.0001,0.02)) +
  scale_color_gradient(limits = c(0, 0.001), low = "green", high = "gray75") +
  theme(legend.position="none") +
  labs(y = "AP Recaps", x = "Kit Marlowe")
Warning message:
“Removed 30490 rows containing missing values (geom_point).”Warning message:
“Removed 30443 rows containing missing values (geom_text).”Warning message:
“Removed 14759 rows containing missing values (geom_point).”Warning message:
“Removed 14700 rows containing missing values (geom_text).”
Out[22]:
Out[22]:

The correlation computations can be done as above.

In [23]:
(WS_nAP <- sum(word_count_2$proportion[is.na(word_count_2$AP_recaps)])) # % of Shakespeare's output not in AP Recaps
(AP_nWS <- sum(word_count_2$AP_recaps[is.na(word_count_2$proportion)])) # % of AP Recaps's output not in Shakespeare
Out[23]:
0.766380368884913
Out[23]:
0.408676551345415
In [24]:
(KM_nAP <- sum(word_count_3$proportion[is.na(word_count_3$AP_recaps)])) # % of Marlowe's output not in AP Recaps
(AP_nKM <- sum(word_count_3$AP_recaps[is.na(word_count_3$proportion)])) # % of AP Recaps's output not in Marlowe
Out[24]:
0.774054521593892
Out[24]:
0.536847885777046
In [25]:
cor.test(data = word_count_2, ~ proportion + `AP_recaps`)
Out[25]:
	Pearson's product-moment correlation

data:  proportion and AP_recaps
t = 2.7923, df = 1687, p-value = 0.005292
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.02019593 0.11515174
sample estimates:
       cor 
0.06782743 
In [26]:
cor.test(data = word_count_3, ~ proportion + `AP_recaps`)
Out[26]:
	Pearson's product-moment correlation

data:  proportion and AP_recaps
t = 1.2037, df = 1199, p-value = 0.2289
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.02186779  0.09112819
sample estimates:
       cor 
0.03474123 

Back to top

$n$-GRAMS

We have been using word, term, token, unit interchangeably when analyzing text up to now, as befits the Bag of "Words" approach.

It's not too hard to think of applications where the basic numerical unit is not the relative frequencies of single words, however, but the links between 2 or more words, in sucession or in co-occurrence.

Rather than tokenize some text by words, we can tokenize it by series of $n$ consecutive words (or $n$-grams).

Are there interesting bigrams in Shakespeare's plays? What would you expect the most common bigrams to be?

In [27]:
library(stringr) # necessary to use str_extract

tidy_ws.2 <- will_shakespeare %>% 
  unnest_tokens(bigram,text,token="ngrams",n=2) %>% # tokenize on bigrams
  #mutate(bigram = str_extract(bigram,"[0-9a-zA-Z'\ ]+")) %>% # to make sure we're not picking up stray punctuation and odd encodings
  count(bigram,sort=TRUE) # produce a count and sort on decreasing frequency

tidy_ws.2
Out[27]:
bigramn
i am 1520
of the 1462
in the 1438
my lord 1323
i will 1277
to the 1234
it is 951
i have 899
to be 776
that i 750
and the 637
is the 628
i do 626
you are 624
and i 598
of my 597
if you 564
i would 563
by the 552
i know 541
i haue 529
he is 482
is a 480
for the 479
with the 462
with a 461
of this 460
is not 457
of his 452
as i 451
zeale and 1
zeale both 1
zeale but 1
zeale king 1
zeale my 1
zeale tis 1
zeale to 1
zeale wanted 1
zeale would 1
zeales we 1
zealous contemplation1
zealous feruour 1
zealous laughter 1
zed thou 1
zenelophon and 1
zenith doth 1
zir come 1
zir without 1
zo long 1
zodiac in 1
zodiacs have 1
zone make 1
zounds a 1
zounds consort 1
zounds he 1
zounds i'll 1
zounds tis 1
zounds who 1
zounds ye 1
zwagger'd out 1

At first glance, among the top 10 most frequent bigrams, only one conveys even a sliver of information: my lord. Everything else is stopword material.

However, what about the 9th most frequent bigram? In a general context, to be is a "stopword" -- but there is at least a few specific instance in this context where that bigram is emphatically not just a "stopword".

Removing bigram stopwords is simple, although not as straigthforward as in the unigram case:

  1. split the two members of the bigrams into 2 columns
  2. verify if each, separately, is a regular stopword
  3. remove the bigrams for which one of the components is a stopword.

For the sake of this exercise, let's also remove words related to the printing business, and theatre terms.

In [28]:
library(tidyr) # 

# we will append a number of words to the stop_words dataset
word = c("gutenberg","shakespeare","  ","etext","1990","1993","public","print","copies","membership"
         ,"commercial","commercially","electronic","download","distribution" 
         ,"ff","f1","f2","f3","f4","NA","collier","ms","cap","txt","zip"
         ,"library","printed", "text","editions"
         ,"executive", "pobox", "fees", "million", "ascii", "legal", "61825", "2782" 
         ,"director", "machine","readable","carnegie","mellon","university"
         #,"exit", "exeunt", "enter", "scene", "act", "folio", "dramatis"
         #,"mine","tis", "thine","thy", "thou","art","hast", "shalt","dost","thee"
         #,"act_4","act_1","act_2","act_3","act_5","sc_1","sc_2","sc_3","sc_4","sc_5"
         #,"sc_6","sc_7","sc_8","sc_9","sc_10","sc_11"
            )
    lexicon = rep("modern",length(word)) # let's call it the modern lexicon
    addition = data.frame(word,lexicon)
    stop_words_ws = rbind(stop_words,addition)

tidy_ws.2_cleaned <- tidy_ws.2 %>%
  separate(bigram, c("FirstTerm","SecondTerm"), sep=" ") %>% # separate the bigrams on the space character
  filter(!FirstTerm %in% stop_words_ws$word) %>% # only retain those rows for which 1st/2nd term is not in stop_words 
  filter(!SecondTerm %in% stop_words_ws$word) %>% # or rather, for which it is false that 1st/2nd term is in stop_word)
  unite(bigram,FirstTerm,SecondTerm, sep=" ") 


tidy_ws.2_cleaned
Out[28]:
bigramn
thou art 403
thou hast 301
sir toby 199
art thou 175
thou shalt 172
exeunt scene 167
king richard 167
dost thou 163
thou wilt 117
sir andrew 116
hast thou 113
queen elizabeth104
wilt thou 98
thou dost 97
scene ii 83
pray thee 81
exeunt enter 78
thou canst 77
mine eyes 73
lord ham 69
scene iii 67
mine owne 65
house enter 64
thy hand 59
thy life 56
didst thou 53
thou didst 51
exit scene 50
thy selfe 48
scene iv 47
youthful gentleman 1
youthful hose 1
youthful lord 1
youthful lover 1
youthful phoebus 1
youthful season 1
youthful travel 1
youthful troilus 1
youthful valentine 1
youthful wages 1
youthfull blood 1
youthfull goates 1
youthfull parcell 1
youthfull spirit 1
yron armes 1
yslaked hath 1
zanies olivia 1
zeal gloucester 1
zeal loath 1
zeal strives 1
zeale king 1
zeale tis 1
zealous contemplation1
zealous feruour 1
zealous laughter 1
zed thou 1
zenith doth 1
zounds consort 1
zounds tis 1
zounds ye 1

We can do a count by book too!

In [29]:
library(stringr) # necessary to use str_extract

tidy_ws.2.by_book <- will_shakespeare %>% 
  unnest_tokens(bigram,text,token="ngrams",n=2) %>% # tokenize on bigrams
  count(gutenberg_id,bigram,sort=TRUE) %>% # produce a count and sort on decreasing frequency
  separate(bigram, c("FirstTerm","SecondTerm"), sep=" ") %>% # separate the bigrams on the space character
  filter(!FirstTerm %in% stop_words$word) %>% # only retain those rows for which 1st/2nd term is not in stop_words 
  filter(!SecondTerm %in% stop_words$word) %>% # or rather, for which it is false that 1st/2nd term is in stop_word)
  unite(bigram,FirstTerm,SecondTerm, sep=" ") # re-unite the bigrams

head(tidy_ws.2.by_book)
Out[29]:
gutenberg_idbigramn
1526 sir toby 199
1103 king richard 151
1526 sir andrew 116
1103 queen elizabeth104
23046 f2 f3 101
23046 f3 f4 98

Can we find the most common Nobles in each book? Yes. We. Can.

In [30]:
tidy_ws.2.by_book_nobles <- will_shakespeare %>% 
  unnest_tokens(bigram,text,token="ngrams",n=2) %>% # tokenize on bigrams
  count(gutenberg_id,bigram,sort=TRUE) %>% # produce a count and sort on decreasing frequency
  separate(bigram, c("FirstTerm","SecondTerm"), sep=" ") %>% # separate the bigrams on the space character
  filter(!FirstTerm %in% stop_words$word) %>% # only retain those rows for which 1st/2nd term is not in stop_words 
  filter(!SecondTerm %in% stop_words$word) %>% # or rather, for which it is false that 1st/2nd term is in stop_word)
  filter(FirstTerm %in% c("lord","queen","sir","duke","king")) %>% # which nobles are commonly mentionned in books?
  unite(bigram,FirstTerm,SecondTerm, sep=" ") # re-unite the bigrams


tidy_ws.2.by_book_nobles
Out[30]:
gutenberg_idbigramn
1526 sir toby 199
1103 king richard 151
1526 sir andrew 116
1103 queen elizabeth104
1524 lord ham 69
1103 queen margaret 36
1121 duke senior 35
2251 sir iohn 24
1126 lord angelo 23
1103 lord hastings 21
1103 king edward 20
1103 lord king 20
2267 lord oth 19
1108 sir proteus 17
1120 lord brutus 17
1526 sir topas 16
2262 lord tim 16
2250 king richard 15
2262 lord timon 15
1108 sir thurio 14
1108 sir valentine 12
2268 lord ant 12
1109 sir nathaniel 11
1114 lord bassanio 11
1524 lord hamlet 11
1539 lord leontes 11
1524 lord pol 10
2251 sir walter 10
2262 lord timons 10
1103 lord mayor 9
23042 sir weigh 1
23046 duke 372 1
23046 duke aegeon 1
23046 duke attended1
23046 duke behold 1
23046 duke menaphon1
23046 lord _aege 1
23046 lord _dro 1
23046 sir _aege 1
23046 sir _dro 1
23046 sir 5 1
23046 sir call'd 1
23046 sir dispatch 1
23046 sir draw 1
23046 sir f1 1
23046 sir f2 1
23046 sir fare 1
23046 sir ff 1
23046 sir hold 1
23046 sir learn 1
23046 sir lo 1
23046 sir malone 1
23046 sir pope 1
23046 sir quoth 1
23046 sir receive 1
23046 sir reverence1
23046 sir sir 1
23046 sir sooth 1
23046 sir steevens 1
23046 sir tis 1

We can also include bigrams to find the tf-idf of bigrams in Shakespeare's plays, providing us with terms that have document-specific importance in the corpus.

In [31]:
ws_tf_idf <- tidy_ws.2.by_book %>% 
  count(gutenberg_id,bigram) %>% # count the bigrams per book
  bind_tf_idf(bigram,gutenberg_id,nn) %>% # compute the tf-idf using bigrams as tokens
  arrange(desc(tf_idf)) # sort by highest tf_idf

ws_tf_idf
Out[31]:
gutenberg_idbigramnntfidftf_idf
1537 0 lady 1 0.0004149378 3.401197 0.001411285
1537 11 o'erspread 1 0.0004149378 3.401197 0.001411285
1537 abide till 1 0.0004149378 3.401197 0.001411285
1537 aboard suddenly 1 0.0004149378 3.401197 0.001411285
1537 abominable boult 1 0.0004149378 3.401197 0.001411285
1537 absence helcanus 1 0.0004149378 3.401197 0.001411285
1537 absolute marina 1 0.0004149378 3.401197 0.001411285
1537 accept feast 1 0.0004149378 3.401197 0.001411285
1537 acquaintance lies 1 0.0004149378 3.401197 0.001411285
1537 act twould 1 0.0004149378 3.401197 0.001411285
1537 actions blacker 1 0.0004149378 3.401197 0.001411285
1537 add ill 1 0.0004149378 3.401197 0.001411285
1537 add sorrow 1 0.0004149378 3.401197 0.001411285
1537 admired lays 1 0.0004149378 3.401197 0.001411285
1537 adventurous worth 1 0.0004149378 3.401197 0.001411285
1537 advice read 1 0.0004149378 3.401197 0.001411285
1537 advise exeunt 1 0.0004149378 3.401197 0.001411285
1537 aesculapius guide 1 0.0004149378 3.401197 0.001411285
1537 agan cleon 1 0.0004149378 3.401197 0.001411285
1537 aged patience 1 0.0004149378 3.401197 0.001411285
1537 air gentlemen 1 0.0004149378 3.401197 0.001411285
1537 air lysimachus 1 0.0004149378 3.401197 0.001411285
1537 air water 1 0.0004149378 3.401197 0.001411285
1537 alive behold 1 0.0004149378 3.401197 0.001411285
1537 all's effectless 1 0.0004149378 3.401197 0.001411285
1537 altar sacrifice 1 0.0004149378 3.401197 0.001411285
1537 altar true 1 0.0004149378 3.401197 0.001411285
1537 alter thy 1 0.0004149378 3.401197 0.001411285
1537 ancient gower 1 0.0004149378 3.401197 0.001411285
1537 ancient substitute1 0.0004149378 3.401197 0.001411285
2246 thou art 1 0.00026723680 0
2246 thou hast 1 0.00026723680 0
2246 thou shalt 1 0.00026723680 0
2250 thou art 1 0.00024084780 0
2250 thou hast 1 0.00024084780 0
2250 thou shalt 1 0.00024084780 0
2251 thou art 1 0.00023375410 0
2251 thou hast 1 0.00023375410 0
2251 thou shalt 1 0.00023375410 0
2253 thou art 1 0.00021146120 0
2253 thou hast 1 0.00021146120 0
2253 thou shalt 1 0.00021146120 0
2262 thou art 1 0.00027693160 0
2262 thou hast 1 0.00027693160 0
2262 thou shalt 1 0.00027693160 0
2264 thou art 1 0.00030084240 0
2264 thou hast 1 0.00030084240 0
2264 thou shalt 1 0.00030084240 0
2267 thou art 1 0.00024801590 0
2267 thou hast 1 0.00024801590 0
2267 thou shalt 1 0.00024801590 0
2268 thou art 1 0.00021925020 0
2268 thou hast 1 0.00021925020 0
2268 thou shalt 1 0.00021925020 0
23042 thou art 1 0.00017343050 0
23042 thou hast 1 0.00017343050 0
23042 thou shalt 1 0.00017343050 0
23046 thou art 1 0.00020084350 0
23046 thou hast 1 0.00020084350 0
23046 thou shalt 1 0.00020084350 0

Silge and Robinson provide a function that can visualize bigram networks

In [32]:
library(dplyr)
library(tidyr)
library(tidytext)
library(ggplot2)
library(igraph)
library(ggraph)

count_bigrams_ws <- function(dataset) {
    word = c("gutenberg","shakespeare","  ","etext","1990","1993","public","print","copies","membership"
         ,"commercial","commercially","electronic","download","distribution" 
         ,"ff","f1","f2","f3","f4","NA","collier","ms","cap","txt","zip"
         ,"library","printed", "text","editions"
         ,"executive", "pobox", "fees", "million", "ascii", "legal", "61825", "2782" 
         ,"director", "machine","readable","carnegie","mellon","university"
         ,"exit", "exeunt", "enter", "scene", "act", "folio", "dramatis"
         #,"mine","tis", "thine","thy", "thou","art","hast", "shalt","dost","thee"
         #,"act_4","act_1","act_2","act_3","act_5","sc_1","sc_2","sc_3","sc_4","sc_5"
         #,"sc_6","sc_7","sc_8","sc_9","sc_10","sc_11"
            )
    lexicon = rep("modern",length(word)) # let's call it the modern lexicon
    addition = data.frame(word,lexicon)
    stop_words_ws = rbind(stop_words,addition)
    
    dataset %>%
    unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
    separate(bigram, c("FirstTerm", "SecondTerm"), sep = " ") %>%
    filter(!FirstTerm %in% stop_words_ws$word,
           !SecondTerm %in% stop_words_ws$word) %>%
    count(FirstTerm, SecondTerm, sort = TRUE)
}

visualize_bigrams <- function(bigrams) {
  set.seed(2016)
  a <- grid::arrow(type = "closed", length = unit(.15, "inches"))
  
  bigrams %>%
    graph_from_data_frame() %>%
    ggraph(layout = "fr") +
    geom_edge_link(aes(edge_alpha = n), show.legend = FALSE, arrow = a) +
    geom_node_point(color = "lightblue", size = 5) +
    geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
    theme_void()
}

ws_bigrams = will_shakespeare %>% count_bigrams_ws() # count the bigrams with the ws stopwords
ws_bigrams %>% filter(n>40, 
                     !str_detect(FirstTerm,"\\d"),
                     !str_detect(SecondTerm,"\\d")) %>%
               visualize_bigrams()
Attaching package: ‘igraph’

The following object is masked from ‘package:tidyr’:

    crossing

The following objects are masked from ‘package:dplyr’:

    as_data_frame, groups, union

The following objects are masked from ‘package:stats’:

    decompose, spectrum

The following object is masked from ‘package:base’:

    union

Out[32]:

Back to top

EXERCISES

  • Build a Shakespeare wordcloud for the Marlowe NAs, and vice-versa
  • Re-run the analyses after having cleaned the datasets of theatre instructions (exit, exeunt, enter, scene, act, folio, dramatis, personae,etc. ) as well as copyright/licensing information and/or modern contaminating terms, and a slightly more restrictive list of early modern english stop_words.
  • Re-tun the $n$-gram for $n=3$
In [0]: